home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
SGI Freeware 2002 November
/
SGI Freeware 2002 November - Disc 2.iso
/
dist
/
fw_glimpse.idb
/
usr
/
freeware
/
catman
/
u_man
/
cat1
/
glimpseindex.Z
/
glimpseindex
Wrap
Text File
|
1997-09-09
|
33KB
|
661 lines
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
NNNNAAAAMMMMEEEE
_g_l_i_m_p_s_e_i_n_d_e_x _3._0 - index whole file systems to be searched
by glimpse
OOOOVVVVEEEERRRRVVVVIIIIEEEEWWWW
_G_l_i_m_p_s_e (which stands for GLobal IMPlicit SEarch) is an
indexing and query system that allows you to search through
all your files very quickly. Glimpseindex is the indexing
program for glimpse. Glimpse supports most of _a_g_r_e_p's
options (_a_g_r_e_p is our powerful version of _g_r_e_p) including
approximate matching (e.g., finding misspelled words),
Boolean queries, and even some limited forms of regular
expressions. It is used in the same way, except that you
don't have to specify file names. So, if you are looking
for a _n_e_e_d_l_e anywhere in your file system, all you have to
do is say _g_l_i_m_p_s_e _n_e_e_d_l_e and all lines containing _n_e_e_d_l_e
will appear preceded by the file name. See man glimpse for
details on how to use glimpse.
Glimpseindex provides three indexing options: a tiny index
(2-3% of the total size of all files), a small index (7-8%)
and a medium-size index (20-30%). Search times are normally
better with larger indexes. To index all your files, you
say _g_l_i_m_p_s_e_i_n_d_e_x ~ for tiny index (where ~ stands for the
home directory), _g_l_i_m_p_s_e_i_n_d_e_x -_o ~ for small index, and
_g_l_i_m_p_s_e_i_n_d_e_x -_b ~ for medium.
Mail glimpse-request@cs.arizona.edu to be added to the
glimpse mailing list. Mail glimpse@cs.arizona.edu to report
bugs, ask questions, discuss tricks for using glimpse, etc.
(this is a moderated mailing list with very little traffic,
mostly announcements). HTML version of these manual pages
can be found in
http://glimpse.cs.arizona.edu:1994/glimpseindexhelp.html
Also, see the glimpse developers home page in
http://glimpse.cs.arizona.edu:1994/
SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
gggglllliiiimmmmppppsssseeeeiiiinnnnddddeeeexxxx [ ----aaaabbbbEEEEffffFFFFiiiiIIIInnnnoooossss ----wwww _n_u_m_b_e_r ----ddddDDDD _f_i_l_e_n_a_m_e(_s) -_H
_d_i_r_e_c_t_o_r_y -_M _n_u_m_b_e_r -_S _n_u_m_b_e_r ] _d_i_r_e_c_t_o_r_y__n_a_m_e[_s]
IIIINNNNTTTTRRRROOOODDDDUUUUCCCCTTTTIIIIOOOONNNN
_G_l_i_m_p_s_e_i_n_d_e_x builds an index of all text files in all the
directories specified and all their subdirectories
(recursively). It is also possible to build several
separate indexes (possibly even overlapping). The simplest
way to index your files is to say
_g_l_i_m_p_s_e_i_n_d_e_x ~
The index consists of several files (described in detail
below), all with the prefix ._g_l_i_m_p_s_e_ stored in the user's
Page 1 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
home directory (unless otherwise specified with the -H
option). Files with one of the following suffixes are not
indexed: ".o", ".gz", ".Z", ".z", ".hqx", ".zip", ".tar".
(Unless the -z option is used, see below.) In addition,
glimpseindex attempts to determine whether a file is a text
file and does not index files that it thinks are not text
files. Numbers are not indexed unless the -n option is
used. It is possible to prevent specified files from being
indexed by adding their names to the .glimpse_exclude file
(described below). The -o option builds a larger index
(typically by a factor of 2-3), allowing for a faster search
(1-5 times faster). The -b builds an even larger index and
allows an even faster search. There is an incremental
indexing option -_f, which updates an existing index by
determining which files have been created or modified since
the index was built and adding them to the index (see -f).
Glimpseindex is reasonably fast, taking about 20 minutes to
index 100MB from scratch (on a SUN Sparc 5) and 2-4 minutes
to update an existing index. (Your mileage may vary.) It is
also possible to increment the index by adding a specific
file (the -a option).
Once an index is built, searching for _p_a_t_t_e_r_n is as easy as
saying
_g_l_i_m_p_s_e _p_a_t_t_e_r_n
(See man glimpse for all glimpse's options and features.)
AAAA DDDDEEEETTTTAAAAIIIILLLLEEEEDDDD DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN OOOOFFFF GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX
Glimpse does not automatically index files. You have to
tell it to do it. This can be done manually, but a better
way is to set it to run every night. It is probably a good
idea to run glimpseindex manually for the first time to be
sure it works properly. The following is a simple script to
run glimpseindex every night. We assume that this script is
stored in a file called glimpse.script:
glimpseindex -w 1000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of
glimpse by changing >& to >>& so that the file .glimpse_out
maintains a history. In this case the file must be created
before the first time >>& is used. If you use ksh, replace
'>&' with '2>&1'.)
Glimpseindex stores the names of all the files that it
indexed in the file .glimpse_filenames. Each file is listed
by its full path name as obtained at the time the files were
indexed. For example, /usr1/udi/file1. Glimpse uses this
full name when it performs the search, so the name must
match the current name. This may become a problem when the
Page 2 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
indexing and the search are done from different machines
(e.g., through NFS), which may cause the path names to be
different. For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1.
(The same is true for several other .glimpse files. See
below.)
Glimpseindex does not follow symbolic links unless they are
explicitly included in the .glimpse_include file (described
below).
Glimpseindex makes an effort to identify non-text files such
as binary files, compressed files, uuencoded files,
postscript files, binhex files, etc. These files are
automatically not indexed. In addition, all files whose
names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip', or
`.tar' will not be indexed (unless they are specifically
included in .glimpse_include - see below).
The options for glimpseindex are as follows:
----aaaa adds the given file[s] and/or directories to an
existing index. Any given directory will be traversed
recursively and all files will be indexed (unless they
appear in .glimpse_exclude; see below). Using this
option is generally much faster than indexing
everything from scratch, although in rare cases the
index may not be as good. If for some reason the index
is full (which can happen unless -o or -b are used)
glimpseindex -a will produce an error message and will
exit without changing the original index.
----bbbb builds a medium-size index (20-30% of the size of all
files), allowing faster search. This option forces
glimpseindex to store an exact (byte level) pointer to
each occurrence of each word (except for some very
common words belonging to the stop list).
----BBBB uses a hash table that is 4 times bigger (256k entries
instead of 64K) to speed up indexing. The memory usage
will increase typically by about 2 MB. This option is
only for indexing speed; it does not affect the final
index.
----dddd ffffiiiilllleeeennnnaaaammmmeeee((((ssss))))
deletes the given file(s) from the index.
----DDDD ffffiiiilllleeeennnnaaaammmmeeee((((ssss))))
deletes the given file(s) from the list of file names,
but not from the index. This is much faster than -d,
and the file(s) will not be found by glimpse. However,
the index itself will not become smaller.
Page 3 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
----EEEE does not run a check on file types. Glimpse normally
attempts to exclude non-text files, but this attempt is
not always perfect. With -E, glimpseindex indexes all
files, except those that are specifically excluded in
.glimpse_exclude and those whose file names end with
one of the excluded suffixes.
----ffff incremental indexing. _g_l_i_m_p_s_e_i_n_d_e_x scans all files and
adds to the index only those files that were created or
modified after the current index was built. If there
is no current index or if this procedure fails,
_g_l_i_m_p_s_e_i_n_d_e_x automatically reverts to the default mode
(which is to index everything from scratch). This
option may create an inefficient index for several
reasons, one of which is that deleted files are not
really deleted from the index. Unless changes are
small, mostly additions, and -o is used, we suggest to
use the default mode as much as possible.
----FFFF Glimpseindex receives the list of files to index from
standard input.
----HHHH ddddiiiirrrreeeeccccttttoooorrrryyyy
Put or update the index and all other .glimpse files
(listed below) in "directory". The default is the home
directory. When glimpse is run, the -H option must be
used to direct glimpse to this directory, because
glimpse assumes that the index is in the home directory
(see also the -H option in glimpse).
----iiii Make .glimpse_include (SEE GLIMPSEINDEX FILES) take
precedence over .glimpse_exclude, so that, for example,
one can exclude everything (by putting *) and then
explicitly include files.
----IIII Instead of indexing, only show (print to standard out)
the list of files that would be indexed. It is useful
for filtering purposes. ("glimpseindex -I dir |
glimpseindex -F" is the same as "glimpseindex dir".)
----MMMM xxxx Tells glimpseindex to use x MB of memory for temporary
tables. The more memory you allow the faster
glimpseindex will run. The default is x=2. The value
of x must be a positive integer. Glimpseindex will
need more memory than x for other things, and
glimpseindex may perform some 'forks', so you'll have
to experiment if you want to use this option. WARNING:
If x is too large you may run out of swap space.
----nnnn Index numbers as well as text. The default is not to
index numbers. This is useful when searching for dates
or other identifying numbers, but it may make the index
Page 4 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
very large if there are lots of numbers. In general,
glimpseindex strips away any non-alphabetic character.
For example, the string abc123 will be indexed as abc
if the -n option is not used and as abc123 if it is
used. Glimpse provides warnings (in .glimpse_messages)
for all files in which more than half the words that
were added to the index from that file had digits in
them (this is an attempt to identify data files that
should probably not be indexed). One can use the
.glimpse_exclude file to exclude data files or any
other files. (See GLIMPSEINDEX FILES.)
----oooo Build a small index rather than tiny (meaning 7-9% of
the sizes of all files - your mileage may vary)
allowing faster search. This option forces
glimpseindex to allocate one block per file (a block
usually contains many files). A detailed explanation
of how blocks affect glimpse can be found in the
glimpse article. (See also LIMITATIONS.)
----ssss supports structured queries. This option was added to
support the Harvest project and it is applicable mostly
in that context. See STRUCTURED QUERIES below for more
information and also http://harvest.cs.colorado.edu for
more information about the Harvest project.
----SSSS kkkk The number k determines the size of the _s_t_o_p-_l_i_s_t. The
stop-list consists of words that are too common and are
not indexed (e.g., 'the' or 'and'). Instead of having
a fixed stop-list, glimpseindex figures out the words
that are too common for every index separately. The
rules are different for the different indexing options.
The tiny index contains all words (the savings from a
stop-list are too small to bother). The small index
(-o), the number k is a percentage threshold. A word
will be in the stop list if it appears in at least k%
of all files. The default value is 80%. (If there are
less than 256 files, then the stop-list is not
maintained.) The medium index (-b) counts all
occurrences of all words, and a word is added to the
stop-list if it appears at least k times per MByte.
The default value is 500. A query that includes a stop
list word is of course less efficient. (See also
LIMITATIONS below.)
----wwww kkkk Glimpseindex does a reasonable, but not a perfect, job
of determining which files should not be indexed.
Sometimes a large text file should not be indexed; for
example, a dictionary may match most queries. The -w
option stores in a file called .glimpse_messages (in
the same directory as the index) the list of all files
that contribute at least _k new words to the index. The
Page 5 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
user can look at this list of files and decide which
should or should not be indexed. The file
.glimpse_exclude contains files that will not be
indexed (see more below). We recommend to set _k to
about 1000. This is not an exact measure. For
example, if the same file appears twice, then the
second copy will not contribute any new words to the
dictionary (but if you exclude the first copy and index
again, the second copy will contribute).
----zzzz Allow customizable filtering, using the file
.glimpse_filters to perform the programs listed there
for each match. The best example is
compress/decompress. If .glimpse_filters include the
line
*.Z uncompress <
(separated by tabs) then before indexing any file that
matches the pattern "*.Z" (same syntax as the one for
.glimpse_exclude) the command listed is executed first
(assuming input is from stdin, which is why uncompress
needs <) and its output (assuming it goes to stdout) is
indexed. The file itself is not changed (i.e., it
stays compressed). Then if glimpse -z is used, the
same program is used on these files on the fly. Any
program can be used (we run 'exec'). For example, one
can filter out parts of files that should not be
indexed. Glimpseindex tries to apply all filters in
.glimpse_filters in the order they are given. For
example, if you want to uncompress a file and then
extract some part of it, put the compression command
(the example above) first and then another line that
specifies the extraction. Note that this can slow down
the search because the filters need to be run before
files are searched.
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX FFFFIIIILLLLEEEESSSS
All files used by glimpse are located at the directory(ies)
where the index(es) is (are) stored and have .glimpse_ as a
prefix. The first two files (.glimpse_exclude and
.glimpse_include) are optionally supplied by the user. The
other files are built and read by glimpse.
....gggglllliiiimmmmppppsssseeee____eeeexxxxcccclllluuuuddddeeee
contains a list of files that glimpseindex is
explicitly told to ignore. In general, the syntax of
.glimpse_exclude/include is the same as that of agrep
(or any other grep). The lines in the .glimpse_exclude
file are matched to the file names, and if they match,
the files are excluded. Notice that agrep matches to
parts of the string! e.g., agrep /ftp/pub will match
/home/ftp/pub and /ftp/pub/whatever. So, if you want
to exclude /ftp/pub/core, you just list it, as is, in
Page 6 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
the .glimpse_exclude file. If you put
"/home/ftp/pub/cdrom" in .glimpse_exclude, every file
name that matches that string will be excluded, meaning
all files below it. You can use ^ to indicate the
beginning of a file name, and $ to indicate the end of
one, and you can use * and ? in the usual way. For
example /ftp/*html will exclude /ftp/pub/foo.html, but
will also exclude /home/ftp/pub/html/whatever; if you
want to exclude files that start with /ftp and end with
html use ^/ftp*html$ Notice that putting a * at the
beginning or at the end is redundant (in fact, in this
case glimpseindex will remove the * when it does the
indexing). No other meta characters are allowed in
.glimpse_exclude (e.g., don't use .* or # or |). Lines
with * or ? must have no more than 30 characters.
Notice that, although the index itself will not be
indexed, the list of file names (.glimpse_filenames)
will be indexed unless it is explicitly listed in
.glimpse_exclude.
....gggglllliiiimmmmppppsssseeee____ffffiiiilllltttteeeerrrrssss
See the description above for the -z option.
....gggglllliiiimmmmppppsssseeee____iiiinnnncccclllluuuuddddeeee
contains a list of files that glimpseindex is
explicitly told to _i_n_c_l_u_d_e in the index even though
they may look like non-text files. Symbolic links are
followed by glimpseindex only if they are specifically
included here. The syntax is the same as the one for
.glimpse_exclude (see there). If a file is in both
.glimpse_exclude and .glimpse_include it will be
excluded unless -i is used.
....gggglllliiiimmmmppppsssseeee____ffffiiiilllleeeennnnaaaammmmeeeessss
contains the list of all indexed file names, one per
line. This is an ASCII file that can also be used with
agrep to search for a file name leading to a fast find
command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that
have 'count' in their name (including anywhere on the
path from the index). Setting the following alias in
the .login file may be useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'
.gggglllliiiimmmmppppsssseeee____iiiinnnnddddeeeexxxx
contains the index. The index consists of lines, each
starting with a word followed by a list of block
numbers (unless the -o or -b options are used, in which
case each word is followed by an offset into the file
.glimpse_partitions where all pointers are kept). The
block/file numbers are stored in binary form, so this
Page 7 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
is not an ASCII file.
....gggglllliiiimmmmppppsssseeee____mmmmeeeessssssssaaaaggggeeeessss
contains the output of the -w option (see above).
....gggglllliiiimmmmppppsssseeee____ppppaaaarrrrttttiiiittttiiiioooonnnnssss
contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options,
some part of the index. This file is used internally
by glimpse and it is a non-ASCII file.
....gggglllliiiimmmmppppsssseeee____ssssttttaaaattttiiiissssttttiiiiccccssss
contains some statistics about the makeup of the index.
Useful for some advanced applications and customization
of glimpse.
SSSSTTTTRRRRUUUUCCCCTTTTUUUURRRREEEEDDDD QQQQUUUUEEEERRRRIIIIEEEESSSS
Glimpse can search for Boolean combinations of
"attribute=value" terms by using the Harvest SOIF parser
library (in glimpse/libtemplate). To search this way, the
index must be made by using the -s option of glimpseindex
(this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize
"structured" files, they must be in SOIF format. In this
format, each value is prefixed by an attribute-name with the
size of the value (in bytes) present in "{}" after the name
of the attribute. For example, The following lines are part
of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse
"pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing". The file
itself is considered to be one "object" and its name/url
appears as the first attribute with an "@" prefix; e.g.,
@FILE { http://xxx... } The scope of Boolean operations
changes from records (lines) to whole files when structured
queries are used in glimpse (since individual query terms
can look at different attributes and they may not be
"covered" by the record/line). Note that glimpse can only
search for patterns in the value parts of the SOIF file:
there are some attributes (like the TTL, MD5, etc.) that are
interpreted by Harvest's internal routines. See
http://harvest.cs.colorado.edu/harvest/user-manual/ for more
detailed information of the SOIF format.
HHHHOOOOWWWW TTTTOOOO DDDDEEEETTTTEEEERRRRMMMMIIIINNNNEEEE TTTTHHHHEEEE IIIINNNNDDDDEEEEXXXX TTTTYYYYPPPPEEEE
If you want to determine the type of an existing index,
check the first 3 lines of the file ".glimpse_index" (which
can be obtained by running "head -3 .glimpse_index"). These
lines always begin with "%". If the first line has the
string "1234567890" after the "%", it means that numbers
Page 8 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
were indexed (glimpseindex -n); otherwise, it means that
numbers were not indexed. If the second line has a 0 after
the "%", then a tiny (default) index was created by
glimpseindex; if there is a negative integer after the "%",
then a medium sized index was created (glimpseindex -b); if
there is a positive integer after the "%", then a small
index was created (glimpseindex -o). In the latter two
cases, the absolute value of the integer tells you the
number of files that were indexed. On the third line, if the
"-s" option of glimpseindex was used to build an index for
structured queries, the positive integer after the "%" tells
you the number of attributes that were found; if not, the
third line just contains a "%0".
RRRREEEEFFFFEEEERRRREEEENNNNCCCCEEEESSSS
1. U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through
Entire File Systems," _U_s_e_n_i_x _W_i_n_t_e_r _1_9_9_4 _T_e_c_h_n_i_c_a_l
_C_o_n_f_e_r_e_n_c_e, San Francisco (January 1994), pp. 23-32.
Also, Technical Report #TR 93-34, Dept. of Computer
Science, University of Arizona, October 1993 (a
postscript file is available by anonymous ftp at
cs.arizona.edu:reports/1993/TR93-34.ps).
2. S. Wu and U. Manber, "Fast Text Searching Allowing
Errors," _C_o_m_m_u_n_i_c_a_t_i_o_n_s _o_f _t_h_e _A_C_M 33335555 (October 1992),
pp. 83-91.
SSSSEEEEEEEE AAAALLLLSSSSOOOO
aaaaggggrrrreeeepppp(1), eeeedddd(1), eeeexxxx(1), gggglllliiiimmmmppppsssseeee(1), gggglllliiiimmmmppppsssseeeesssseeeerrrrvvvveeeerrrr(1),
ggggrrrreeeepppp(1V), sssshhhh(1), ccccsssshhhh(1).
LLLLIIIIMMMMIIIITTTTAAAATTTTIIIIOOOONNNNSSSS
The index of glimpse is word based. A pattern that contains
more than one word cannot be found in the index. The way
glimpse overcomes this weakness is by splitting any multi-
word pattern into its set of words and looking for all of
them in the index. For example, gggglllliiiimmmmppppsssseeee ''''lllliiiinnnneeeeaaaarrrr
pppprrrrooooggggrrrraaaammmmmmmmiiiinnnngggg'''' will first consult the index to find all files
containing both _l_i_n_e_a_r and _p_r_o_g_r_a_m_m_i_n_g, and then apply agrep
to find the combined pattern. This is usually an effective
solution, but it can be slow for cases where both words are
very common, but their combination is not.
The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts all
patterns to lower case, finds the appropriate files, and
then searches the actual files using the original patterns.
So, for example, _g_l_i_m_p_s_e _A_B_C_X_Y_Z will first find all files
containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found. One problem with this approach
is discovering misspellings that are caused by wrong cases.
Page 9 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX((((llll))))
For example, _g_l_i_m_p_s_e -_B _a_b_c_X_Y_Z will first search the index
for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are
matches with no errors, and will go to those files to search
them directly, this time with the original upper cases. If
the closest match is, say AbcXYZ, glimpse may miss it,
because it doesn't expect an error. Another problem is
speed. If you search for "ATT", it will look at the index
for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for
example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple
patterns with Boolean AND. More complicated patterns are
currently limited to approximately 30 characters. Lines are
limited to 1024 characters. Records are limited to 48K, and
may be truncated if they are larger than that. The limit of
record length can be changed by modifying the parameter
Max_record in agrep.h.
Each line in .glimpse_exclude or .glimpse_include that
contains a * or a ? must not exceed 30 characters length.
Glimpseindex does not index words of size > 64.
A medium-size index (-b) may lead to actually slower query
times if the files are all very small.
Under -b, it may be impossible to make the stop list empty.
Glimpseindex is using the "sort" routine, and all
occurrences of a word appear at some point on one line.
Sort is limiting the size of lines it can handle (the value
depends on the platform; ours is 16KB). If the lines are
too big, the word is added to the stop list.
BBBBUUUUGGGGSSSS
Please send bug reports or comments to
glimpse@cs.arizona.edu.
AAAAUUUUTTTTHHHHOOOORRRRSSSS
Udi Manber and Burra Gopal, Department of Computer Science,
University of Arizona, and Sun Wu, the National Chung-Cheng
University, Taiwan. (Email: glimpse@cs.arizona.edu)
Page 10 (printed 11/3/95)